Airfare Exploration by Amir Ebrahimi

## [1] "Omitting 1485 NA records (0.92% of all records)"
##       Year         quarter         nsmiles         airport_1  
##  Min.   :2001   Min.   :1.000   Min.   : 101.0   DAL    :174  
##  1st Qu.:2005   1st Qu.:1.000   1st Qu.: 840.2   MDW    :120  
##  Median :2009   Median :2.000   Median :1194.0   JFK    : 64  
##  Mean   :2008   Mean   :2.431   Mean   :1181.6   EWR    : 56  
##  3rd Qu.:2012   3rd Qu.:3.000   3rd Qu.:1617.0   HOU    : 51  
##  Max.   :2014   Max.   :4.000   Max.   :2620.0   LGB    : 49  
##                                 NA's   :5        (Other):971  
##    airport_2     passengers            fare           carrier_lg  
##  MDW    :180   Min.   :  0.1000   Min.   :  63.95   99     :   4  
##  LGB    : 75   1st Qu.:  0.1099   1st Qu.: 216.05   CO     :   3  
##  JFK    : 66   Median :  0.2174   Median : 298.00   G4     :   2  
##  HOU    : 64   Mean   :  0.7028   Mean   : 344.06   AA     :   1  
##  SWF    : 64   3rd Qu.:  0.3297   3rd Qu.: 409.22   US     :   1  
##  PIE    : 61   Max.   :137.0000   Max.   :2133.00   (Other):   0  
##  (Other):975                                        NA's   :1474  
##     large_ms         fare_lg        carrier_low       lf_ms     
##  Min.   :0.0454   Min.   : 95.76   99     :   4   Min.   :1     
##  1st Qu.:0.0670   1st Qu.:121.76   G4     :   2   1st Qu.:1     
##  Median :0.0800   Median :196.00   AA     :   1   Median :1     
##  Mean   :0.4656   Mean   :256.97   3M     :   0   Mean   :1     
##  3rd Qu.:1.0000   3rd Qu.:322.75   9K     :   0   3rd Qu.:1     
##  Max.   :1.0000   Max.   :618.00   (Other):   0   Max.   :1     
##  NA's   :1478     NA's   :1478     NA's   :1478   NA's   :1482  
##     fare_low     
##  Min.   : 95.76  
##  1st Qu.:102.64  
##  Median :109.52  
##  Mean   :113.09  
##  3rd Qu.:121.76  
##  Max.   :134.00  
##  NA's   :1482

Univariate Plots Section

##  [1] "Year"        "quarter"     "nsmiles"     "airport_1"   "airport_2"  
##  [6] "passengers"  "fare"        "carrier_lg"  "large_ms"    "fare_lg"    
## [11] "carrier_low" "lf_ms"       "fare_low"    "haul"
## Classes 'tbl_df' and 'data.frame':   160258 obs. of  14 variables:
##  $ Year       : num  2001 2001 2001 2001 2001 ...
##  $ quarter    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ nsmiles    : int  1300 151 1308 1347 1351 160 1339 646 654 963 ...
##  $ airport_1  : Factor w/ 295 levels "ABE","ABI","ABQ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ airport_2  : Factor w/ 299 levels "ATL","AUS","AVL",..: 48 51 54 64 106 113 115 160 195 278 ...
##  $ passengers : num  0.889 9.444 41.667 0.222 2.222 ...
##  $ fare       : num  130 298 355 175 322 ...
##  $ carrier_lg : Factor w/ 39 levels "3M","99","9K",..: 9 33 33 9 11 32 33 24 32 33 ...
##  $ large_ms   : num  0.75 0.965 0.421 1 0.7 ...
##  $ fare_lg    : num  149 304 337 175 358 ...
##  $ carrier_low: Factor w/ 49 levels "3M","99","9K",..: 13 42 42 11 29 40 42 29 40 42 ...
##  $ lf_ms      : num  0.25 0.965 0.421 1 0.2 ...
##  $ fare_low   : num  74.5 304.1 337.4 174.5 129.3 ...
##  $ haul       : Factor w/ 2 levels "short","medium": 2 1 2 2 2 1 2 2 2 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1485] 33 95 194 266 270 427 430 432 441 579 ...
##   .. ..- attr(*, "names")= chr [1:1485] "33" "95" "194" "266" ...
## $airport_1
##   [1] "ABE" "ABI" "ABQ" "ABY" "ACK" "ACT" "ACV" "ACY" "AEX" "AGS" "ALB"
##  [12] "ALO" "ALW" "AMA" "APF" "ART" "ASE" "ATL" "ATW" "AUS" "AUW" "AVL"
##  [23] "AVP" "AZA" "AZO" "BDL" "BFL" "BGM" "BGR" "BHB" "BHM" "BIL" "BIS"
##  [34] "BJI" "BKG" "BLI" "BLV" "BMI" "BNA" "BOI" "BOS" "BPT" "BQK" "BRO"
##  [45] "BTM" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CEC" "CHA"
##  [56] "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL" "CLT" "CMH" "CMI"
##  [67] "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG" "CVG" "CWA" "CYS"
##  [78] "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN" "DIK" "DLH"
##  [89] "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "EKO" "ELM" "ELP"
## [100] "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA" "FLG"
## [111] "FLL" "FLO" "FNL" "FNT" "FOE" "FSD" "FSM" "FWA" "FYV" "GCC" "GEG"
## [122] "GFK" "GJT" "GNV" "GPT" "GRB" "GRK" "GRR" "GSO" "GSP" "GTF" "GTR"
## [133] "GUC" "GYY" "HDN" "HFD" "HGR" "HHH" "HLN" "HOB" "HOU" "HPN" "HRL"
## [144] "HSV" "HTS" "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "ILE" "ILG"
## [155] "ILM" "IND" "INL" "IPL" "IPT" "ISN" "ISP" "ITH" "JAC" "JAN" "JAX"
## [166] "JFK" "JLN" "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LCH" "LCK" "LEB"
## [177] "LEX" "LFT" "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWS" "LYH"
## [188] "MAF" "MBS" "MCI" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [199] "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH" "MOB"
## [210] "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY" "MYR"
## [221] "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL" "OTH"
## [232] "PAH" "PBG" "PBI" "PDX" "PFN" "PGV" "PHF" "PHL" "PHX" "PIA" "PIE"
## [243] "PIH" "PIT" "PLN" "PNS" "PSC" "PSM" "PSP" "PUW" "PVD" "PWM" "RAP"
## [254] "RDD" "RDM" "RDU" "RFD" "RIC" "RNO" "ROA" "ROC" "RST" "RSW" "SAF"
## [265] "SAN" "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SDF" "SEA" "SFO"
## [276] "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SNA" "SPI" "SRQ" "STL"
## [287] "SUN" "SWF" "SYR" "TLH" "TOL" "TPA" "TRI" "TYS" "VPS"
## 
## $airport_2
##   [1] "ATL" "AUS" "AVL" "AVP" "AZA" "BDL" "BFL" "BHM" "BIL" "BKG" "BLI"
##  [12] "BMI" "BNA" "BOI" "BOS" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE"
##  [23] "CAK" "CEC" "CHA" "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL"
##  [34] "CLT" "CMH" "CMI" "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG"
##  [45] "CVG" "CWA" "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN"
##  [56] "DIK" "DLH" "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "ELM"
##  [67] "ELP" "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA"
##  [78] "FLG" "FLL" "FLO" "FNT" "FSD" "FSM" "FWA" "FYV" "GCC" "GCK" "GEG"
##  [89] "GFK" "GJT" "GNV" "GPT" "GRB" "GRI" "GRK" "GRR" "GSO" "GSP" "GTF"
## [100] "GTR" "GUC" "HDN" "HFD" "HHH" "HLN" "HOU" "HPN" "HRL" "HSV" "HTS"
## [111] "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "IFP" "ILE" "ILG" "ILM"
## [122] "IND" "IPT" "ISN" "ISP" "ITH" "IYK" "JAC" "JAN" "JAX" "JFK" "JLN"
## [133] "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LBE" "LCH" "LEB" "LEX" "LFT"
## [144] "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWB" "LWS" "LYH" "MAF"
## [155] "MBS" "MCI" "MCN" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [166] "MGW" "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH"
## [177] "MOB" "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY"
## [188] "MYR" "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL"
## [199] "OTH" "OXR" "PAH" "PBG" "PBI" "PDX" "PFN" "PGD" "PGV" "PHF" "PHL"
## [210] "PHX" "PIA" "PIE" "PIH" "PIT" "PLN" "PNS" "PQI" "PSC" "PSM" "PSP"
## [221] "PUW" "PVC" "PVD" "PVU" "PWM" "RAP" "RDD" "RDM" "RDU" "RFD" "RHI"
## [232] "RIC" "RKD" "RKS" "RNO" "ROA" "ROC" "ROW" "RST" "RSW" "SAF" "SAN"
## [243] "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SCK" "SDF" "SEA" "SFB"
## [254] "SFO" "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SMX" "SNA" "SPI"
## [265] "SPS" "SRQ" "SSI" "STC" "STL" "STS" "SUN" "SUX" "SWF" "SYR" "TEX"
## [276] "TLH" "TOL" "TPA" "TRI" "TSS" "TTN" "TUL" "TUP" "TUS" "TVC" "TWF"
## [287] "TXK" "TYR" "TYS" "UIN" "VGT" "VLD" "VPS" "WAS" "WYS" "XNA" "YKM"
## [298] "YNG" "YUM"
## 
## $carrier_lg
##  [1] "3M" "99" "9K" "AA" "AS" "AX" "AZ" "B6" "CO" "DH" "DL" "E9" "F9" "FL"
## [15] "G4" "GQ" "HP" "JI" "KS" "LH" "N7" "NJ" "NK" "NW" "OS" "PN" "RP" "SY"
## [29] "TW" "TZ" "U5" "UA" "US" "VX" "WN" "XP" "YX" "YY" "ZV"
## 
## $carrier_low
##  [1] "3M" "99" "9K" "AA" "AQ" "AS" "AX" "AZ" "B6" "BA" "CO" "DH" "DL" "E9"
## [15] "EY" "F9" "FL" "G4" "GQ" "HP" "IB" "JI" "KS" "LH" "N7" "NH" "NJ" "NK"
## [29] "NW" "OH" "OO" "OS" "PN" "RP" "SN" "SY" "TW" "TZ" "U5" "UA" "UP" "US"
## [43] "VX" "WN" "WS" "YV" "YX" "YY" "ZV"
##       Year         quarter        nsmiles       airport_1     
##  Min.   :2001   Min.   :1.00   Min.   :  67   DCA    :  7730  
##  1st Qu.:2005   1st Qu.:1.00   1st Qu.: 619   DFW    :  7608  
##  Median :2010   Median :2.00   Median : 962   EWR    :  7213  
##  Mean   :2009   Mean   :2.47   Mean   :1095   IAD    :  5596  
##  3rd Qu.:2012   3rd Qu.:3.00   3rd Qu.:1476   IAH    :  5110  
##  Max.   :2014   Max.   :4.00   Max.   :2783   JFK    :  5071  
##                                               (Other):121930  
##    airport_2        passengers            fare          carrier_lg   
##  TPA    :  8271   Min.   :   0.100   Min.   :  17.0   DL     :37638  
##  ORD    :  7902   1st Qu.:   7.609   1st Qu.: 184.3   AA     :24327  
##  LGA    :  5838   Median :  27.283   Median : 233.2   UA     :23099  
##  MDW    :  5660   Mean   : 186.968   Mean   : 242.0   US     :20919  
##  JFK    :  5183   3rd Qu.: 148.152   3rd Qu.: 288.9   WN     :19847  
##  IAD    :  4719   Max.   :7021.087   Max.   :2566.9   CO     :13325  
##  (Other):122685                                       (Other):21103  
##     large_ms         fare_lg        carrier_low        lf_ms       
##  Min.   :0.1000   Min.   :  17.0   DL     :34342   Min.   :0.0100  
##  1st Qu.:0.5083   1st Qu.: 181.6   AA     :28543   1st Qu.:0.2171  
##  Median :0.6772   Median : 231.5   UA     :19720   Median :0.4894  
##  Mean   :0.6883   Mean   : 241.8   US     :19502   Mean   :0.5293  
##  3rd Qu.:0.8965   3rd Qu.: 289.4   WN     :16374   3rd Qu.:0.8762  
##  Max.   :1.0000   Max.   :2566.9   CO     :12084   Max.   :1.0000  
##                                    (Other):29693                   
##     fare_low          haul       
##  Min.   :  12.0   short : 27889  
##  1st Qu.: 167.3   medium:132369  
##  Median : 212.9                  
##  Mean   : 223.4                  
##  3rd Qu.: 265.7                  
##  Max.   :2566.9                  
## 

Across all of the airport pair markets:

What is the shortest airport pair distance?

## Source: local data frame [1 x 3]
## 
##   airport_1 airport_2 nsmiles
## 1       MKE       ORD      67

I’m curious which airport pairs had the most passengers traveling (on average) across all quarters:

## Source: local data frame [7,824 x 3]
## 
##    airport_1 airport_2 mean_passengers
## 1        LAX       SFO        4853.820
## 2        JFK       LAX        4450.665
## 3        LGA       ORD        4016.727
## 4        FLL       LGA        3635.271
## 5        ATL       LGA        3553.825
## 6        JFK       SFO        3130.356
## 7        LAX       ORD        2823.389
## 8        LAS       LAX        2815.348
## 9        FLL       JFK        2769.823
## 10       EWR       MCO        2764.043
## ..       ...       ...             ...

Now, I’m wanting to see what the most traveled airport pairs were for any given quarter:

## Source: local data frame [7,824 x 3]
## 
##    airport_1 airport_2 max_passengers
## 1        JFK       LAX       7021.087
## 2        FLL       LGA       6134.340
## 3        LAX       SFO       5829.239
## 4        LGA       ORD       5783.587
## 5        JFK       SFO       5165.326
## 6        FLL       JFK       4974.610
## 7        BOS       LGA       4563.407
## 8        LAS       LAX       4527.000
## 9        ATL       LGA       4494.505
## 10       JFK       MCO       4246.630
## ..       ...       ...            ...
## Source: local data frame [1 x 14]
## 
##   Year quarter nsmiles airport_1 airport_2 passengers     fare carrier_lg
## 1 2014       4    2510       JFK       LAX   7021.087 412.7212         DL
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
##   lf_ms (dbl), fare_low (dbl), haul (fctr)

I’m wondering now how many unique airports there are in the dataset:

## Source: local data frame [338 x 1]
## 
##     ap
## 1  YUM
## 2  YNG
## 3  YKM
## 4  XNA
## 5  WAS
## 6  UIN
## 7  TXK
## 8  TWF
## 9  TVC
## 10 TUS
## .. ...
## Source: local data frame [338 x 4]
## 
##     ap sum_passengers avg_passengers avg_fare
## 1  ORD        3715830       339.7486 217.0218
## 2  LGA        3043005       281.5511 233.9431
## 3  DFW        2782758       269.6210 226.9230
## 4  EWR        2414071       226.1213 264.6953
## 5  JFK        2209363       215.4635 234.3288
## 6  TPA        2094884       243.4780 189.0277
## 7  LAX        2054506       549.7743 263.5185
## 8  DCA        1955809       187.9321 249.1613
## 9  MDW        1780883       190.6522 194.8835
## 10 IAH        1721053       181.3353 240.0298
## .. ...            ...            ...      ...
## Source: local data frame [338 x 4]
## 
##     ap sum_passengers avg_passengers avg_fare
## 1  ATL      1201812.8      1074.9668 192.5350
## 2  MCO      1128439.9      1068.5984 171.9714
## 3  DEN       874282.2       785.5186 202.4002
## 4  LAS      1231757.4       734.0628 196.4701
## 5  PHX       771706.4       633.5849 218.0178
## 6  SEA       643998.6       581.7512 244.1213
## 7  LAX      2054506.4       549.7743 263.5185
## 8  DTW       595011.0       549.4100 209.9226
## 9  MSP       591489.3       536.2550 234.4933
## 10 SAN       519371.6       511.1926 247.7528
## .. ...            ...            ...      ...

I wonder what the most expensive airports to travel from are:

## Source: local data frame [338 x 4]
## 
##     ap sum_passengers avg_passengers avg_fare
## 1  TSS       4.590513      0.1995875 616.5870
## 2  JRB       1.524128      0.2177326 516.5652
## 3  IYK    2960.440939     38.4472849 480.1068
## 4  DIK      80.115942     11.4451346 456.3364
## 5  ISN     394.069703     13.1356568 452.4509
## 6  SMX     227.339377      7.8392889 424.4812
## 7  SUN    3965.633287     18.0256059 377.8083
## 8  OXR     220.545458      6.3012988 372.9604
## 9  IDA   16632.282650     41.7896549 361.7234
## 10 ASE   32450.613572     49.9240209 359.4056
## .. ...            ...            ...      ...

Moving on to other variables:

## 99% 
## 493

The plot looks to have a good normal curve, but with a long tail. The 99th quantile of the dataset is just under 500, so it makes sense to set a limit on the x-axis to better view the histogram of fares

I’m curious what the outliers look like:

## Source: local data frame [1,600 x 14]
## 
##    Year quarter nsmiles airport_1 airport_2 passengers     fare carrier_lg
## 1  2013       2    2583       HPN       MRY  0.1098901 2566.860         UA
## 2  2008       1     278       DAL       XNA  0.1098901 2156.020         AA
## 3  2011       4    2508       HPN       MFR  0.1086957 1824.990         UA
## 4  2008       2    1746       EGE       JFK  0.2197802 1813.935         NW
## 5  2003       3    1377       BGM       EFD  0.1000000 1645.000         CO
## 6  2007       2     283       JFK       PHF  0.1098901 1311.990         US
## 7  2012       1    1372       DAL       SBP  0.2197802 1308.550         UA
## 8  2002       4     840       EFD       GSP  0.1000000 1292.000         CO
## 9  2011       1    1848       BUR       MGM  0.1111111 1255.980         UA
## 10 2014       1    2292       BUR       EYW  0.1111111 1248.000         DL
## ..  ...     ...     ...       ...       ...        ...      ...        ...
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
##   lf_ms (dbl), fare_low (dbl), haul (fctr)

Looks like most large fares involve large distances and I’ll want to look at the relationship between fares and distance when I get to bivariate analysis. I also notice that the average passengers per day is under 1 for those city pairs. Another interesting thing to look at is the market share for those city pairs. The DAL<->XNA pair seemed odd because of the low mileage, so I looked on Kayak to see what current prices were and they were all above $500.

I am curious about what the average fare is for airport pairs that have less than 1 person / day on average to see if it is high:

## Source: local data frame [1 x 1]
## 
##    avgfare
## 1 297.2354

Surprisingly, it isn’t as high as I thought it would be, but does fall beyond the third quartile.

Mileage between airport pairs looks somewhat multimodal with heteroskedasticity or possibly a skewed normal distribution.

Re-scaling x-axis (log10) to view the long tail better:

Now, I’m curious about the difference between the average low and large fares:

## Source: local data frame [1 x 1]
## 
##   fare_diff
## 1  18.37674
## Source: local data frame [1 x 1]
## 
##   mean(fare_diff_percentage)
## 1                 0.07012206

So, it looks like fares have a variance of 7% of the average fare between the average large and low fares.

Seems like there is a sizable count of carriers with the largest fare that have 100% market share. I’d like to compare that to the rest:

## Source: local data frame [2 x 2]
## 
##   large_ms == 1      n
## 1         FALSE 147608
## 2          TRUE  12650
##            n
## 1 0.08569996

This last one is interesting, since there is a spike at 10%, so I looked at the data to verify that it was good, which it is. The reason for this spike in the data is explained by the DOT:

Note that the “lowest fare carrier” is the carrier with the lowest average fare that has at least a 10 percent share of the traffic in the market, except for markets where only a single carrier has a 10 percent or greater share.

## Source: local data frame [7,270 x 8]
## 
##    Year quarter nsmiles airport_1 airport_2 carrier_low  lf_ms fare_low
## 1  2001       3     696       ABE       MDW          CO 0.0741      179
## 2  2001       3     655       ABE       ORD          DL 0.0144      198
## 3  2001       3    1270       ABQ       ATL          TW 0.0834      143
## 4  2001       3     744       ABQ       IAH          AA 0.0286      117
## 5  2001       3     677       ABQ       LAX          HP 0.0853       97
## 6  2001       3     719       ABQ       MCI          CO 0.0185       99
## 7  2001       3    1122       ABQ       MDW          AA 0.0152      146
## 8  2001       3     889       ABQ       OAK          HP 0.0725      119
## 9  2001       3     610       ABQ       SAT          AA 0.0714      138
## 10 2001       3     718       ALB       MDW          CO 0.0220      109
## ..  ...     ...     ...       ...       ...         ...    ...      ...

Finally, I’m curious about what percentage of carriers are both the largest and lowest fare carrier:

## Source: local data frame [1 x 1]
## 
##   n()/nrow(af)
## 1    0.5591921

Univariate Analysis

What is the structure of your dataset?

There are 160,258 records in the dataset with 13 features (Year, quarter, nsmiles, airport_1, airport_2, passengers, fare, carrier_lg, large_ms, fare_lg, carrier_low, lf_ms, and fare_low). The variables airport_1, airport_2, carrier_lg, and carrier_low are unordered factor variables.

Other observations:

  • the average miles traveled is 1095
  • the average passengers per day is 187
  • the median passengers per day is 27
  • the average of average quarterly fares is $242 and the max is $2567
  • Delta is the most common carrier for the largest and lowest fares

What is/are the main feature(s) of interest in your dataset?

The main features of interest to me in the dataset are fare, nsmiles, and passengers. I’d like to figure out what variables influence the fare. My guess is that distance between airports is a factor and that the activity between airports may have some weight.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Market share may have an impact on the average fare. My guess would be that areas that have high market share have less competition and may have higher fares.

Did you create any new variables from existing variables in the dataset?

At the time of my univariate analysis I had not created any variables. However, after performing bivariate analysis I thought it would be good to create an additional categorical variable, haul, for segmenting short and medium trips. A medium trip is anything over 500 miles.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Data cleaning was the bulk of the work during my investigation. I saw anomalies in market share, large and low fares, and erroneous data that needed to be marked as NA. Some variable naming changed across the years of the dataset and some fields were combined, so I had to fix those as well. I added the fixup code to the data loading functions. These changes were necessary in order to have consistent data to analyze.

Bivariate Plots Section

##                   Year       quarter      nsmiles    passengers
## Year        1.00000000 -0.0387753123  0.127808813  0.0140409715
## quarter    -0.03877531  1.0000000000 -0.003134517  0.0007013818
## nsmiles     0.12780881 -0.0031345172  1.000000000 -0.0144414386
## passengers  0.01404097  0.0007013818 -0.014441439  1.0000000000
## fare        0.31893655 -0.0370997782  0.457826614 -0.1829057014
## large_ms   -0.02335850 -0.0056159138 -0.343139151 -0.0690767254
## fare_lg     0.29604732 -0.0357017595  0.431890715 -0.1551365111
## lf_ms       0.08091732 -0.0161899485 -0.236037321 -0.0919099302
## fare_low    0.32965613 -0.0215216567  0.376823089 -0.1788191433
##                   fare     large_ms     fare_lg       lf_ms     fare_low
## Year        0.31893655 -0.023358503  0.29604732  0.08091732  0.329656135
## quarter    -0.03709978 -0.005615914 -0.03570176 -0.01618995 -0.021521657
## nsmiles     0.45782661 -0.343139151  0.43189072 -0.23603732  0.376823089
## passengers -0.18290570 -0.069076725 -0.15513651 -0.09190993 -0.178819143
## fare        1.00000000 -0.140722104  0.96032828 -0.07386059  0.905739991
## large_ms   -0.14072210  1.000000000 -0.13615080  0.69085587  0.004520862
## fare_lg     0.96032828 -0.136150796  1.00000000 -0.11410075  0.873874571
## lf_ms      -0.07386059  0.690855870 -0.11410075  1.00000000  0.081461978
## fare_low    0.90573999  0.004520862  0.87387457  0.08146198  1.000000000
## 
##  Pearson's product-moment correlation
## 
## data:  af$nsmiles and af$fare
## t = 206.1513, df = 160256, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4539482 0.4616877
## sample estimates:
##       cor 
## 0.4578266
## 
##  Pearson's product-moment correlation
## 
## data:  af$Year and af$fare
## t = 134.7118, df = 160256, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3145317 0.3233276
## sample estimates:
##       cor 
## 0.3189366

As predicted, there is a strong positive correlation between mileage and fare, but passengers has a negligible relationship to price. Year has a moderate positive correlation with fare.

It is interesting to see that carriers with the largest fare lose market share as the distance increases. My guess is that with the correlation between distance and fare that there is more competition across longer flights because there is more revenue to be gained.

## Source: local data frame [2 x 2]
## 
##     haul mean(passengers)
## 1  short         212.5549
## 2 medium         181.5771

Interesting to see that short haul trips have more passengers on average than medium trips (e.g. coast-to-coast travel)

Looks like 2012 was the best year for average passengers traveling, a drop in 2013, and a return in 2014.

It’s interesting to see the average fares among the most traveled airports and see which ones are higher or lower than the median fare among the group.

## Source: local data frame [6 x 3]
## 
##   larger_ap median_fare      diff
## 1       LAX    275.4786 42.290852
## 2       EWR    261.3781 28.190415
## 3       VPS    244.9723 11.784548
## 4       DCA    242.0185  8.830777
## 5       XNA    238.5907  5.402982
## 6       IAH    237.1551  3.967411

Well, looks like if you’re in any of the cities above you’re paying more than the median fare across all airports, but it also depends on where you’re going. Also, it’s worth mentioning that these airports serve plenty of markets, which could be a factor.

## 
## Call:
## lm(formula = fare ~ nsmiles, data = af)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -252.76  -50.20   -9.17   38.68 2232.60 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.741e+02  3.798e-01   458.4   <2e-16 ***
## nsmiles     6.200e-02  3.008e-04   206.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.7 on 160256 degrees of freedom
## Multiple R-squared:  0.2096, Adjusted R-squared:  0.2096 
## F-statistic: 4.25e+04 on 1 and 160256 DF,  p-value: < 2.2e-16

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Fare correlates strongly with the distance that is being traveled. There has also been a slow trend towards increasing prices since 2005, but it might be good to compare this to inflation to see if actual costs have increased. Passengers had a negligible relationship to fare.

Based on the \(R^2\) value, distance accounts for 20.96% of the variance in fare. Other features can be incorporated into the model to better explain the variance in price, which I will build on in multivariate analysis.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, market share seems to drop when distance traveled increases. I think that is because there is more competition among longer distance flights, since there is more revenue to be gained.

What was the strongest relationship you found?

The strongest relationship I found was between fare and distance.

Multivariate Plots Section

It’s neat to see the cut-off of distance with DFW and ORD, which makes sense since they’re more central to the country.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

## Source: local data frame [1 x 1]
## 
##   short_haul_value
## 1        0.6775934
## Source: local data frame [1 x 1]
## 
##   med_haul_value
## 1      0.2301141

If you’re traveling under 500 miles, then you can expect to pay $0.68 per mile on average. If you’re traveling over 500 miles, then you can can expect to pay roughly $0.23 per mile on average.

Now, I want to see a better breakdown of fare value per distance traveled:

## 
## Formula: fare/nsmiles ~ A/sqrt(nsmiles)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## A   9.6914     0.0125   775.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1875 on 160257 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 6.06e-11

As you can see there is a lot of variance in cost per mile for travel under 500 miles. However, as a consumer if you travel larger distances you will eventually enjoy much better costs per mile.

With the graph of fare value being non-linear, I now want to revisit fare vs miles traveled with different scales:

I now have a much better linear fit here. Let’s revisit the linear model and see if it improves:

## 
## Call:
## lm(formula = I(log10(fare)) ~ I(sqrt(nsmiles)), data = af)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14930 -0.08221  0.00236  0.08421  1.08870 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.119e+00  1.148e-03  1846.2   <2e-16 ***
## I(sqrt(nsmiles)) 7.545e-03  3.468e-05   217.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.133 on 160256 degrees of freedom
## Multiple R-squared:  0.2279, Adjusted R-squared:  0.2279 
## F-statistic: 4.731e+04 on 1 and 160256 DF,  p-value: < 2.2e-16

\(R^2\) went up by 1.83% with this change, so we can now better account for the relationship between fares and distance between airports.

## Source: local data frame [10 x 2]
## 
##    larger_ap median_value
## 1        LAX     5.768249
## 2        TPA     5.688935
## 3        JFK     4.678094
## 4        LGA     4.536045
## 5        IAH     4.080295
## 6        MDW     4.064503
## 7        EWR     4.023544
## 8        DCA     3.921803
## 9        DFW     3.913508
## 10       ORD     3.585313

Here, we can see that LAX (Los Angeles Airport) fares enjoy the most bang for the buck (i.e. miles per dollar) among the larger airports. ORD (Chicago Airport) is among the weakest in terms of miles per dollar. You could consider this if you were moving to either of these two cities and planning to travel quite a bit.

## Source: local data frame [12 x 14]
## 
##    Year quarter nsmiles airport_1 airport_2 passengers     fare carrier_lg
## 1  2002       1    1515       LAS       ORD   2074.440 142.7000         N7
## 2  2002       2    1515       LAS       ORD   2441.640 139.4100         N7
## 3  2002       3    2066       IAD       LAS    320.970 209.5000         N7
## 4  2002       3    1515       LAS       ORD   2437.500 127.9900         N7
## 5  2001       2    2248       JFK       LAS   2223.626 169.5230         N7
## 6  2001       2    1521       LAS       MDW   1936.044 127.0379         N7
## 7  2001       3    1055       DFW       LAS   1474.000 138.0000         N7
## 8  2001       3    2248       JFK       LAS   2227.000 166.0000         N7
## 9  2001       3    1521       LAS       MDW   1628.000 123.0000         N7
## 10 2001       3    2176       LAS       MIA    684.000 158.0000         N7
## 11 2001       3    2177       LAS       PHL   1297.000 156.0000         N7
## 12 2001       3     414       LAS       SFO   2340.000  77.0000         N7
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
##   lf_ms (dbl), fare_low (dbl), haul (fctr)

National Airlines gave the best mileage per dollar fares between 2001-2002, but it might explain why they are no longer around, too.

## Source: local data frame [7 x 2]
## 
##   carrier_lg median_value
## 1         VX     7.246457
## 2         B6     6.754951
## 3         WN     5.078585
## 4         AA     4.472823
## 5         UA     4.424600
## 6         DL     3.969786
## 7         US     3.059363

Historically, it looks like Virgin America and JetBlue Airways are the two airlines that provide the best value for mileage per dollar. If you’re traveling long distances across country, it’s likely you’ll find good fares with those two airlines.

Over the years we have seen an increasing cost per mile traveled, but currently we’re experience a downward trend in cost per mile.

Back to our linear model of log10(fare) ~ sqrt(nsmiles) + additional variables:

## R-squared for model: 0.556
## [1] 0.1008139
## Don't know how to automatically pick scale for object of type AsIs. Defaulting to continuous

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

With an understanding of the strong relationship between distance and fare it made sense to me to look at other variables from that perspective. After seeing the relationship of fare value (cost per mile) to distance I thought to improve on the linear model that I had started in bivariate analysis.

Were there any interesting or surprising interactions between features?

The cost per mile graph was surprising in that the cost can go from $1 / mi at around a 250 mile distance to less than $0.40 / mi at around a 500 mile distance. The additional analysis into worst and best value airports and carriers was interesting to me.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model using the log10 of fare and the square root of nsmiles. Other variables included are Year and airport pair (i.e. airport_1 and airport_2).

The variables of the linear model account for 55.6% of the variance in airline fares.


Final Plots and Summary

Plot One

Description One

The cost per mile for air travel has fluctated over time and was the worst in 2008. Currently, we are experiencing a downward trend in cost per mile traveled.

Plot Two

Description Two

The plot of mileage per dollar across the ten largest airports (by activity) shows that Los Angeles International Airport (LAX) and Tampa International Airport (TPA) have the best median mileage per dollar spent. O’Hare International Airport (ORD) has the worst median mileage per dollar.

The plot of mileage per dollar across the most popular airlines shows that Virgin America and JetBlue Airways provide the best value in terms of how far you can travel per dollar. US Airways (now owned by American Airlines) is among the worst in terms of mileage per dollar.

Plot Three

Description Three

The plot of fare value (i.e. cost per mile) with one-way distance between airports roughly follows an inverse square root function. The cost per mile has high variance for travel up to 500 miles.


Reflection

The DOT Domestic Airline Consumer Airfare Report has quarterly airfare data for 1996-2014. Unfortunately, only 2001-2014 had complete and usable data, so that is what was used by me to amass the 160,258 airport pair records. I started by simply cleaning and aggregating the data. Then, I explored single, double, and multiple variables in various ways to better understand the data. I was surprised that passenger activity did not have much weight on the predictive model of fare, but indirectly that may have been captured in the airport pair. The only strong positive correlation with fare was distance between airports, which was the basis of the linear model created. A secondary influence was Year, since there has been a general trend of increasing prices over time. The linear model improved once I realized that converting a non-linear relationship to a linear relationship (log10(fare) ~ sqrt(nsmiles) would better account for variance. Some limitations with this model are that it is averaged data instead of individual fares and that there are missing factors, such as how many days before the flight that the fare was purchased. If I had powerful computing at my disposal I would want to perform a similar analysis to this one on individual fare data.

Some interesting observations that I made while performing this analysis: